‘White Wine Quality’ is a tidy dataset contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) 12 - quality (score between 0 and 10)
/Which chemical properties influence the quality of white wines?
In this stduy, I will make exploratory data analysis to understand the data before the inferential stastics tests are applied.
The structure of the data:
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
All of the variables are numbers and there exist no factor type in the dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Quality values are between 3 and 9. Median and Mean is very close to each other which means the distribution is not so-skewed. Lets start with the distribution of the quality outputs.
The quality of the wines are mostly cumulated at value of 6. There exist very few wines having quality score of 9. Since it is scale, there exist no decimal values.
Let’s investigate other variables and their distributions.
The distribution of acidity is very close to normal distribution. But there exist some outliers in the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
It can also be seen from the summary table that there exist few data points between 3rd Quantile and Max Values.
if I trim top 1 percentile, I obtain the below graph which is approximately normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
In the dataset, there exist extreme data points which make dataset skewed. I should investigate these extreme points and their possible relationship with the quality.
If I trim the top 1 percentile, then I obtain the below graph.
The distribution is still right skewed.
It is interesting to see that there exist extreme counts at 0.3, which seems to be peak and at around 0.5. Other than 0.5, the citric acid values has a bell shaped distribution. An extra attention should be given 0.5 point.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
In citric acid values, there also exist extreme high values. If the top 1 percentile is trimmed, I obtain the below graph which is much close to normal-like distribution.:
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
0.49 is a common citric acid value although the other close points are not common.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar distribution is highly skewed. There exist very few extremely high values but so many extremely low values.
If the residual.sugar value is log transformed below graph is obtained:
It is interesting to see that the distribution is bimodal. Therefore, two groups of wines exist. One is sweet (around 10) and one is non-sweet(around 1.5). Most probably, people either prefer sweet or non-sweet wines.
##
## 0.6 0.7 0.8 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
## 2 7 25 39 4 93 1 146 3 187 3 147
## 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9
## 2 184 4 142 2 165 2 99 1 99 3 59
## 1.95 2 2.05 2.1 2.2 2.25 2.3 2.35 2.4 2.5 2.6 2.65
## 2 79 1 51 56 2 42 1 41 40 33 1
## 2.7 2.8 2.85 2.9 3 3.1 3.15 3.2 3.3 3.4 3.5 3.6
## 38 36 1 25 17 17 1 28 23 13 31 22
## 3.7 3.75 3.8 3.85 3.9 3.95 4 4.1 4.2 4.25 4.3 4.35
## 12 2 21 3 17 3 19 17 31 2 19 1
## 4.4 4.45 4.5 4.55 4.6 4.7 4.75 4.8 4.85 4.9 5 5.1
## 14 3 33 2 40 29 5 38 1 35 43 28
## 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.7 5.8
## 2 29 4 17 2 23 2 13 1 16 30 23
## 5.85 5.9 5.95 6 6.1 6.2 6.3 6.35 6.4 6.5 6.55 6.6
## 2 19 1 23 21 31 39 1 34 26 1 30
## 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 7.05 7.1 7.2 7.25
## 3 25 1 28 6 20 1 31 2 36 29 2
## 7.3 7.35 7.4 7.45 7.5 7.6 7.7 7.75 7.8 7.85 7.9 7.95
## 19 2 40 1 30 29 34 2 41 1 32 1
## 8 8.1 8.15 8.2 8.25 8.3 8.4 8.45 8.5 8.55 8.6 8.65
## 32 34 1 36 2 31 13 1 24 1 27 1
## 8.7 8.75 8.8 8.9 8.95 9 9.05 9.1 9.15 9.2 9.25 9.3
## 18 2 22 23 1 18 1 17 2 22 2 11
## 9.4 9.5 9.55 9.6 9.65 9.7 9.8 9.85 9.9 10 10.05 10.1
## 10 9 1 18 4 22 16 3 18 18 3 14
## 10.2 10.3 10.4 10.5 10.55 10.6 10.65 10.7 10.8 10.9 11 11.1
## 23 16 25 16 1 22 1 26 17 11 19 18
## 11.2 11.25 11.3 11.4 11.45 11.5 11.6 11.7 11.75 11.8 11.9 11.95
## 18 2 12 14 1 11 15 8 4 35 16 3
## 12 12.05 12.1 12.15 12.2 12.3 12.4 12.5 12.55 12.6 12.7 12.75
## 16 1 21 4 15 13 19 16 2 16 16 1
## 12.8 12.85 12.9 13 13.1 13.15 13.2 13.3 13.4 13.5 13.55 13.6
## 25 4 25 19 23 1 13 16 7 10 3 12
## 13.65 13.7 13.8 13.9 14 14.05 14.1 14.15 14.2 14.3 14.35 14.4
## 4 21 8 18 16 1 4 1 20 17 3 17
## 14.45 14.5 14.55 14.6 14.7 14.75 14.8 14.9 14.95 15 15.1 15.15
## 3 17 3 13 14 2 12 14 2 13 7 1
## 15.2 15.25 15.3 15.4 15.5 15.55 15.6 15.7 15.75 15.8 15.9 16
## 6 1 9 17 11 6 14 9 1 6 2 10
## 16.05 16.1 16.2 16.3 16.4 16.45 16.5 16.55 16.6 16.65 16.7 16.75
## 6 2 7 7 5 1 3 1 2 5 5 2
## 16.8 16.85 16.9 16.95 17 17.05 17.1 17.2 17.3 17.35 17.4 17.45
## 4 4 3 3 1 1 5 9 14 1 2 2
## 17.5 17.55 17.6 17.7 17.75 17.8 17.85 17.9 17.95 18 18.05 18.1
## 8 3 2 1 4 13 5 2 3 2 3 6
## 18.15 18.2 18.3 18.35 18.4 18.5 18.6 18.75 18.8 18.9 18.95 19.1
## 8 3 2 4 1 1 1 4 3 1 3 1
## 19.25 19.3 19.35 19.4 19.45 19.5 19.6 19.8 19.9 19.95 20.15 20.2
## 3 4 1 2 3 2 1 4 1 3 1 2
## 20.3 20.4 20.7 20.8 22 22.6 23.5 26.05 31.6 65.8
## 1 1 2 2 2 1 1 2 2 1
It is difficult to understand the graph using this bin sizes. I should narrow down them, change the x scale to obtain a better visualiztaion. Let’s look at the summary table of chlorides variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The distribution is good but the spread of data is wide. If I omit top 1 percent of data:
Although most of the data points are clustered around 0.05 (3rd quartile), there exist considerable amount of data above 0.05. Let’s log transfrom the data to obtain a better insight.
The distribution of data stays the same. Large amount of data is cumulated around 0.5 and there exist a spread of data having value grater than 0.10
##
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022
## 1 1 1 4 4 5 5 10 9 16 19 19
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 0.034
## 20 34 30 54 58 85 81 108 107 109 119 168
## 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044 0.045 0.046
## 130 200 160 167 157 182 147 184 141 201 170 181
## 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058
## 171 174 133 170 115 104 130 99 61 88 68 53
## 0.059 0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07
## 36 46 19 25 23 15 8 18 18 7 18 6
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082
## 5 2 5 8 2 9 1 2 4 4 2 2
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094
## 5 5 3 4 3 2 1 2 1 3 3 5
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108 0.11 0.112 0.114
## 2 6 1 3 1 1 1 1 2 3 1 1
## 0.115 0.117 0.118 0.119 0.12 0.121 0.122 0.123 0.126 0.127 0.13 0.132
## 1 3 1 3 1 2 1 4 3 2 1 1
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149
## 1 1 1 2 2 3 1 1 1 2 1 1
## 0.15 0.152 0.154 0.156 0.157 0.158 0.16 0.167 0.168 0.169 0.17 0.171
## 1 2 1 1 4 1 2 2 3 2 2 1
## 0.172 0.173 0.174 0.175 0.176 0.179 0.18 0.184 0.185 0.186 0.194 0.197
## 2 2 2 2 2 1 1 2 2 1 1 2
## 0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239 0.24 0.244 0.255
## 1 2 1 2 1 1 1 1 1 1 1 1
## 0.271 0.29 0.301 0.346
## 1 1 1 1
There exist several data points above 0.01 and these data points have large spread.
Lets arrange binwidths to obtain deeper insight.
If we look also the summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
There exist extremely large variables similar to other variables. If the top 1 percentile is omitted:
This time, the distribution is quite better and similar to normal. We can see that only very small amount of data have extreme values. The skewness in the data is very low.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The data follows a pattern similar to previous variables. There exist extremely large variables but most of the data has a bell-shaped normal-like distribution. If the top 1 percentile is omitted the below distribution is obtained.
Most of the data values are between 50 and 240.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The spread of density seems very narrow. Let’s use smaller bin sizes:
Most density values are between 0.98 and 1. There exist also a few extreme values. Lets omit the top 1 percentile.
The most of the density data are cumulated between 0.990 and 0.997
The data distribution seems bell-shaped and there are no extreme values. If the bin size is narowed:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
There is ignorable skewness in the data set and the spread is quite narrow.
The data seem to be right skewed. Let’s narrow the bin sizes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
If the top 1 percentile is ommited:
There exist still some right-skewness in the data but very close to bell-shaped distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The distribution seems to be multi- modal. These are (8.5 - 10), (10-11.5) and (11.5-13). The biggest group is (8.5-10) group. If the data is log transformed:
Most data exist at the point 9.5. Nearly twice of the other peaks.
##
## 8 9 10 11 12 13 14
## 317 1606 1256 906 675 131 7
There exist 4898 observations and 12 variables in the dataset. All variables are numeric and quality is an inetger type. Most variables have right-skewed distributions with extremely high values.
Sulphate and density also follow the pattern : bell shaped distribution with few extreme high values.
Alcohol has 3 main clusters around 9.5, 10.5 and 12.5.
The main features in the data set are alcohol, volitile acidity and residual sugar. I suspect those features and some combinations of the other variables can be used to build a predictive model.
I think that sulfur, sulphate and choloride values are also important factors to be investigated in the future.
No, I did not.
form of the data? If so, why did you do this?
I made severeal operations on the data set. I have log transformed skewed distribution of residual sugar to visualize the behavior of data better. I also trimmed top 1 percentile of the data points of most variables to visulize the common pattern in the data.
I played with bin sizes to visulize the distribution clearly and define the pattern. Generally, I arranged the bin sizes according to data points’ precision to detect the details. By this way, I have detected the cluster of data around 0.49 in the citric acid feature.
There exist different quality levels for the same amount of sugar. Let’s investigate quality conditional on residual.sugar
Average quality has very high variance conditional on residual.sugar. For very close residual.sugar values, quality changes a lot which means very low correlation. However on average, extreme residual.sugar values have less quality.
When residual sugar is between 1.5 and 5 the quality is robust and highest mean of means.
When residual sugar is between 5 and 10, variance in quality is very high and quality mean reaches very high values. However, mean of means is quite low.
When we look at the mean and mean of means, we can see that there exist a pattern in quality conditional on alcohol. If the extreme values are trimmed:
The trimmed model has a better positive linearrelationship between 9.5 and 13. Best robust qualities are reached between 12 and 13 alcohol level.
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
From the graph, it can be seen that there exists a negative relatinship between volitile acidity and quality. Let’s investigate the data without extreme points.
Trimming the extreme high points decreased the slope, however a negative relatinship can be observed clearly. After 0.5 volitile acidity, the slope (strenght of the relationship) increases.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
Quality and free.sulfur.dioxide has a positive relationship between 0 and 30. After 40, mean of means decreases and falls down to quality lecel of 6.
total.sulfur.dioxide seems to have a positive relationship with quality betweeen 0 and 90. The curve’s slope becomes negative after 100, but the strenght of relationship is low. It is important to emphasize that quality value is very robust between 75 and 150. For small values of total sulfur dioxide, quality is very volitile.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Cholorides values are mostly cumulated around 0 and 0.1. If we look at the summary table:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
even the third quantile is 0.05. If we trim extreme values and draw quality condional on cholorides:
It is clear that quality is very robust between 0.025 and 0.75 with a negative relationship with chlorides. The volatility increases after 0.10.
It is difficult to say that there exist a relationship between quality and sulphates visually. The only significant increase is around 0.8 sulphates value.
We can also look at citric acid, fixed acidity, density and pH relationships.